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ABSTRACT 

Large numbers of mass spectrometry proteomics 
studies are being conducted to understand all 
types of biological processes. The size and com- 
plexity of proteomics data hinders efforts to easily 
share, integrate, query and compare the studies. 
The Model Organism Protein Expression Database 
(MOPED, htttp://moped. proteinspire.org) is a new 
and expanding proteomics resource that enables 
rapid browsing of protein expression information 
from publicly available studies on humans and 
model organisms. MOPED is designed to simplify 
the comparison and sharing of proteomics data for 
the greater research community. MOPED uniquely 
provides protein level expression data, meta- 
analysis capabilities and quantitative data from 
standardized analysis. Data can be queried for 
specific proteins, browsed based on organism, 
tissue, localization and condition and sorted by 
false discovery rate and expression. MOPED 
empowers users to visualize their own expression 
data and compare it with existing studies. Further, 
MOPED links to various protein and pathway data- 
bases, including GeneCards, Entrez, UniProt, KEGG 
and Reactome. The current version of MOPED 
contains over 43000 proteins with at least one 
spectral match and more than 11 million high cer- 
tainty spectra. 

INTRODUCTION 

Protein expression, the presence or quantity of a protein in 
a biological sample, is one of the key measures essential 
for understanding biological processes. The data serve as a 



snapshot of the state of an organism at the time of sample 
collection. Notably, aberrant protein expression patterns 
in disease states may be indicative of the mis-regulations 
associated with the disease. MOPED (http://moped 
.proteinspire.org) was motivated, in part, by the idea 
that easy public access to protein expression data will 
enable scientists to better identify and understand 
protein expression patterns that are related to significant 
diseases and biological processes. 

Mass spectrometry-based proteomics is the most 
common approach used to survey complex samples for 
the presence of proteins and their expression (1,2). To 
provide ample context for the data contained in 
MOPED, we briefly describe a proteomics workflow. 

Prior to analysis by mass spectrometry, proteins are 
typically digested into their peptide components. Search 
engines such as Sequest, Mascot, XITandem and OMSSA 
match the spectra generated by tandem mass spectrometry 
with peptides from a target protein sequence database 
(3-6). Due to the highly complex nature of protein 
samples and their processing, as well as mass spectrometry 
instrumentation, approaches and analysis, peptide 
spectral matches are associated with varying degrees of 
uncertainty (7-9). Once peptide spectral matches are 
formed, the peptides are amalgamated into protein iden- 
tifications with associated measures of statistical certainty. 
Commonly, peptide spectral matches are performed 
against decoy databases generated by reversing or 
randomizing the target database to estimate the false dis- 
covery rate (FDR) associated with protein and peptide 
identifications (10,11). 

From these searches, estimates of protein expression 
can be determined by using measures such as spectra 
counts (the number of identified spectra which correspond 
to a specific protein), sequence coverage and peak areas or 
intensities (12,13). Expression in mass spectrometry prote- 
omics experiments can be measured dichotomously in 
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terms of the certainty of a protein being present or with 
quantitative measures that reflect the protein's concentra- 
tion. Relative expression measures are used for comparing 
the relative amounts of the same protein across different 
conditions. Absolute expression, the quantification of 
different proteins within the same sample is difficult to 
measure in part due to variability in individual protein 
responses to mass spectrometry assay methods. 

A number of websites provide host services for massive 
proteomics datasets (14-17). Although these repositories 
are excellent resources for accessing raw data and quick 
experimental summaries, they neither provide protein ex- 
pression data, nor do they allow for a standardized com- 
parison of expression levels across tissues, localizations 
and conditions. Furthermore, the extreme scale of data 
in these repositories makes meta-analysis and even 
simple querying of these datasets a staggering challenge, 
often worthy of its own publication (18,19). Such 
meta-analysis typically requires the download of raw 
data, whose volume is often measured in terabytes, and 
analysis of these data through a computationally intensive 
proteomics workflow. In cases where summary informa- 
tion is available, these data may be in varying formats, 
have been processed through non-standard pipelines and 
often provide limited or non-comparable statistical 
measures of protein identification certainty. Additionally, 
proteome profiles from other resources omit the relevant 
expression information (20). 

The aforementioned challenges hinder the utilization of 
publicly available proteomics data. Enabling researchers 
to access these data in an effective manner is an important 
challenge in proteomics. MOPED complements the avail- 
ability of raw data from other resources by presenting 
standardized data analysis and enabling the user to view 
experimental data relative to existing expression pro- 
files across many different tissues, localizations and con- 
ditions (21). 

Where there are multiple experimental datasets for a 
given combination of organism, tissue, localization and 
condition, a meta-analysis is provided based on the 
recently published approach (18). The simple format of 
the MOPED data and the straightforward approach to 
meta-analysis allows for the uncomplicated combin- 
ation of proteomics datasets. These features and compari- 
sons empower the user to make meaningful statements 
about identified proteins with respect to the existing 
knowledge-base. 

DATABASE CONTENT 

Expression data 

The core component of MOPED's database is the reposi- 
tory of expression information from public proteomics 
datasets. By storing and displaying essential summary in- 
formation without requiring the user to download any 
files, MOPED simplifies access to the proteomics data. 
To maintain statistical integrity, MOPED requires that 
statistical measures be provided for each protein identifi- 
cation, including the protein FDR and spectral counts. 
A full list of required measures is found in Table 1. 



Table 1. The fields required for each protein expression data point in 
MOPED 



lHcIUSUC 


Definition 


Expression percentile 


1 he percentile (0— 100 /o) corresponding 




to the protein expression level in this 




experiment 


Normalized expression 


Number of spectra counts divided by 




sequence length normalized to the 




maximum expression value in the 




experiment (0-1) 


FDR 


Cumulative FDR threshold for protein 




identification 


Spectral count 


The number of unique spectra identified 




which correspond to the identified 




proteins. 


Unique peptides 


Number of unique peptide sequences 




identified 


Sequence coverage 


Percentage of the protein sequence covered 




by identified peptide sequences 



Users may submit data to MOPED by providing either 
raw files or pre-processed data. Currently, all data dis- 
played in MOPED were analyzed using the standardized 
data analysis and statistical methods of the SPIRE 
pipeline (21,22). 

Meta-data 

A major problem when accessing public data is a lack of 
specificity from data providers about experimental proto- 
cols. To prevent this frustration, MOPED requires a 
minimum amount of meta-data that must be included 
with each dataset. At the experiment level, users must 
supply a brief experimental description, the source 
organism from the NCBI taxon database and any applic- 
able journal references (23). Additionally, each protein 
identification is associated with a tissue, localization and 
condition which align with the BRENDA Tissue 
Ontology, Cell Type Ontology and Disease Ontology, 
respectively (24-26). 

Organisms 

MOPED contains information on both humans and 
model organisms. Not only does studying model organ- 
isms increase our understanding of biological systems, but 
also studies of model organisms can inform our know- 
ledge of homologous systems in humans and other 
species (27). Thus far, MOPED contains data from four 
of the most studied organisms: Homo sapiens (human), 
Mus musculus (mouse), Caenorhabditis elegans (worm) 
and Saccharomyces cerevisiae (yeast). 

Protein information 

To maximize information content, MOPED has been built 
to link out to many of the most popular and useful data 
resources. In terms of protein identifiers, MOPED has 
universal links to the heavily utilized UniProt and NCBI 
databases and organism-specific links to the authoritative 
WormBase and Saccharomyces Genome Database 
(28-31). A symbiotic relationship has been established 
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Table 2. Release statistics as of 10 November 2011 



Species 


Proteins 


Proteins 


High 




with at least 


with <1% 


confidence 




one spectral 


FDR 


spectra 




match 






Homo sapiens (human) 


15 847 


6102 


3 906 048 


Mus musculus (mouse) 


10308 


5935 


2 650237 


Caenorhabditis elegans (worm) 


10922 


7383 


1 979 744 


Saccharomyces cerevisiae (yeast) 


6717 


3747 


2 809 390 


Total 


43 794 


23 167 


11 345419 



whereby, MOPED links to GeneCards and GeneCards 
displays MOPED's data (32). MOPED contains an in- 
novative database that extends coverage of proteins to 
pathway databases (KEGG, Reactome, Metacyc, 
PANTHER and SEED) using orthologous groups of 
proteins specified by both the aforementioned pathways 
databases and eggNOG (33-38). In total, MOPED links 
to 10 external databases. 

Release statistics 

As of 10 November 2011, MOPED contains 43 794 
proteins with at least one high certainty spectral match, 
23 167 proteins with an FDR<1% and more than 11 
million spectra (39). These data come from 35 experiments 
on 4 organisms covering 13 tissues, 21 localizations and 
10 conditions. Organism-specific release statistics are in 
Table 2. In addition to individual experiments, the 
database also contains meta-analyses of yeast and worm 
data based upon the recently published approach to 
meta-analysis (18). 

USER INTERFACE 

MOPED front page 

The MOPED front page (http://moped.proteinspire.org) 
provides a description of the MOPED resource and 
contains tabs to access database search, upload data and 
view help files. 

MOPED search view 

MOPED's access point to proteomics data is located in 
the 'Search' tab. From this view, users are able to access 
the entirety of MOPED's expression database (Figure 1, 
top). Protein expression data can be both browsed by 
categories such as organism, tissue and localization and 
queried by protein ID and keywords. After the user has 
selected filters, clicking the 'Search' button quickly renders 
all matching expression data points and associated 
meta-data. Most of the search view is dominated by the 
'Protein ID and Expression Summary' section which 
displays expression data resulting from the user's query. 
Each row in the expression summary table displays all 
statistical information contained in Table 1, as well as 
experimental meta-data. Complete protein annotations 
can be viewed by hovering over either the protein IDs or 
partial annotations. The set of meta-data corresponding 



to all displayed expression information is summarized 
under the separate 'Experiment Summaries' table. The fil- 
tering capabilities at the top of the MOPED interface's 
Search tab allows users to query on these different 
experiments. 

MOPED protein view 

Clicking on a protein ID from any tab allows the user to 
open a page containing all stored information related 
to that protein, including the protein annotation, links 
to protein and pathway databases and identifications of 
that protein in other MOPED experiments (Figure 1, 
bottom). 

The primary advantage of MOPED's protein view over 
other databases is the presentation of expression data 
from many experiments side by side. On the protein 
page, MOPED automatically displays the expression in- 
formation for that protein in every single experiment 
contained in MOPED (Figure 1, bottom). Ideally, this 
information will enable the user to identify meaningful 
expression patterns across different conditions. The same 
expression information has been incorporated with both 
GeneCards (human data only) and SPIRE (32,21). 

MOPED upload 

Through the upload tab, users can compare their experi- 
mental data with the data contained in the MOPED 
servers. User upload of data automatically filters 
MOPED data to display only those proteins which were 
identified in the user's experiment. For identification only 
queries, users are able to upload a list of UniProt protein 
identifiers. For expression based queries, users may 
upload UniProt protein identifiers, expression and FDR 
values and condition names. Once this information has 
been uploaded, the user can experiment with several 
functionalities in the Upload tab (Figure 2). MOPED 
displays the data for proteins identified in both the 
user's experiment and experiments in the MOPED 
servers. These data may be interrogated in the same 
manner as the MOPED search page. For identification 
visualization, MOPED separates user data based on con- 
dition and generates overlap plots of the identifications 
with dynamic thresholding by protein FDR (Figure 3). 
For expression visualization, MOPED dynamically 
generates heatmaps of the user-uploaded data with user- 
specified expression value thresholding (Figure 4). 

MOPED documentation 

MOPED provides a comprehensive help file and a tutorial 
example to clarify the usage and highlight its features. 
This documentation is accessible under the Help tab and 
comes in the form of two pdf files. The tutorial contains 
real data examples. 

FUTURE DIRECTIONS 

Increased data and public data submission 

MOPED is currently involved in a number of collabor- 
ations that will dramatically increase the amount of 
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Search Upload About Moped Moped Example 



c 



Filter results 



Kidney 



—Select A Localization— » —Select A Condition— » —Select An Experiment— » 



Gene OR Taxon OR Keyword ▼ [~ 



Search Restl 



experiment 



Experiment Summaries 5 



Experiment 


Description 


Journal References Tissue/Cell Type : Localization : Condition 


tanner _kidrvoy 


A proteomics study of a human embryonic kidney eel toe (HEK293) used to 
validate human genome annotation 


Tanner el al (2007) Improving gene annotation usrg peptide mass spectrometry Genome Res. Kidney : Embryonic Cel Line : Standard 
Feb; 17(2) 23 1-9. PMIO:17189379 



Protein ID and Expression Summary G 



0 Sort by: Expression % 
v^Pagel of 244 



* Descending T son 



in data 



Protein ID 


Expression % 


Normalized 
Expression 


FDR 

Error 


Condition 


Tissue/Cell 
Type 


Localization 


Spectral 
Counts 


Unique 
Peptides 


Sequence 
Coverage % 


V 


Experiment 


Description 


P33778 


100.00 


1 00 


0.003 


Standard 


Kidney 


Embryonic CelLne 


18704 


6 


44.44 




tanner_tcidney 


Histone H2B type 1-B OS-Homo sapien 


P06733 


99.99 


7.632E-1 


0.000 


Standard 




Embryonic CelUne 


43910 


19 


65.44 




tanner kidney 


Tax Id =9606 Gene Symtx>l=EN01 Isofor 






99.98 


7.441 E-1 


0.003 


Standard 


Kidney 


Embryorac Cel Line 


10161 


9 


5922 




tannerjtidney 


TaxJd-9606 GeneJ5ymbol''HIST2H4B;HI 


P60709 




99.98 


5.269E-1 


0.000 


Standard 


Kidney 


Embryonic Cel Line 


26192 


16 


60 80 




tanner_kidney 


Tax Jd*9606 Gene J3ymbor»ACTB Aeon. 


P62258 




99.97 


5.146E-1 


0.000 


Standard 


Kidney 


Embryonic Cel Line 


17395 


16 


72.16 




tannerjtidney 


TaxJd-9606 GeneJ3yn*c4-YWHA£ 14-3- 


P68363 




99.96 


4.57E-1 


0.000 


Stai i v 1 


Kidney 


Embryonic Cel Liw 


27321 


16 


62.97 




tannerjtidney 


Tubulin alpha- 1 B chain OS=Homo sapi 


P07737 




99.95 


4.043E-I 


0.003 


Standard 


Kidney 


Embryonic Cel Line 


7503 


8 


73.57 




tanner_kidney 


Tax Jd=9606 Gene J3ymbol=PFN1 Profi 


P62987 




99.94 


3.747E-1 


0.005 


Standard 


Kidney 


Embryonic Cel Line 


6358 


Tt 


53.12 




tannerjtidney 


Ubkjnitri-60S nbosomal protein L40 


P14618 




99.93 


3.402E-1 


0.000 




Kidney 


Embryonic Cel Line 


23946 


30 


74.01 




tannerjtidney 


Taxjd=9606 GeneJ3ymbor=PKM2 Isofor 


P08670 




99.93 


3.342E-1 


0.000 


Standard 


Kidney 


Embryonc Cel Lite 


20646 


29 


76.18 




tannerjtidney 


Taxjd=9606 GeneJ5ymbor=VIM Vimenti 






99.92 


3.295E-1 


0.013 


Stmtard 


Kidney 


Embryonic Cel Line 


5722 


2 


25.95 




tannerjtidney 


Taxjd=9606 GeneJ5ymbor=HIST1H2AAH 









Protein ID: P06733 

Description: Tax_ld=9606 Gene_Symbol=EN01 Isoform alpha-enolase of Alpha-enolase 



Associated Genes: Entrez ID 2023 in GeneCards 
Source ID 
Gl 4503571 
REFSEQ NP 001419 
UNIPROT 08X450 
UNIPROT P06733 



External protein/gene links- 
Uniprot, Refseq, Gl, GeneCards 



Links To Pathway Information 



METACYC 

METACYC 

METACYC 

METACYC 

PANTHER 

REACT0ME 

REACTOME 

REACTOME 



glycolysis IV iptant cytosoi) glycolysis 4 



Metabolism ol carbohydrates 



Pathway links- 
MetaCyc, Reactome, 
KEGG, Panther, SEED 




Diabetes pathways 



MOPED data 

1 



Expression Data For Protein 



Expression % 


Normalized 
Expression 


FDR 
Error 


Condition 


TlssueJCell 
Type 


Localization 


Spectral 
Counts 


Peptides 


Sequence 
Coverage % 


Experiment 


Description 


99.99 


7.632E-1 


0.000 


Standard 


Kidney 


Embryonic Coll Una 


43910 


19 


65.44 


lannorjddnoy 


Tax_ld=9606 Gsno_Symbol=ENOl Isotor 


99.94 


6.873E-1 


0.000 




Blood 


Erythroleukomlc Cell Lino 


3106 


22 


61.29 


teres mg crythro 


Tax ld-96C6 Gene Symool-ENOi Isofor 


99.05 


8.564E-2 


0.000 


COPD Blood 


T-Cofls 


1315 


31 


73.04 


5teflan_copd 


Tax_ld=9606 Gene_Syiihbol=ENOl Isofor 


96.79 


7.67E-2 


0.000 


Standard 


Blood 


T-Colls 


1020 


35 


79.49 


stcttan copd 


Tax ld-9606 Gcno Symbol-ENOi Isofor 


98.58 


B873E-2 


0.000 


Standard Blood 


Monocyte 


853 


27 


67 28 


sJeTfan_copd 


TaxJd=96C6 GeneJjyiiibol-ENOI Isotor 


98.40 


6.262E-2 


0.000 


COPD 


Blood 


Monocyte 


5B4 


28 


68.66 


steftancopd 


Tax Id =96 C6 GenoJjymtwl=ENOi Isofor 


98.24 


1 696E-1 


0.000 


Cancer Liver 


Secreted 


482 


26 


0.69 


wa ng_liver_cancer 


TaxJd=9B06 GenB_Symbol=EN01 Isofor 


97.22 


3 642E-2 


o.ooo 


Standard 


Blood 


B- Lymphocytes 


227 


21 


62.44 


steffan copd 


Tax H-96C6 Gene SyrntxH=EN01 Isofor 



Figure 1. MOPED views. The main MOPED view, on top and the protein view, on bottom. Clicking on links for an identified protein in the main 
MOPED view brings up the protein view. In this example, P06733 has been selected from the main MOPED view. 



proteomics data available. Though all MOPED data are 
currently loaded in-house, work is in progress to create an 
interface for public submission of proteomics expression 
data. Users will be able to fulfill publication and grant 
requirements for data preservation by uploading their 



datasets to MOPED. Researchers interested in submitting 
their data are invited to contact the MOPED team at 
moped@proteinspire.org. In addition to increasing the 
number of protein identification experiments, MOPED 
plans to utilize data from relative expression experiments, 
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Search Upload About Moped Moped Example 

Upload a file containing a list of proteins or a delimited (by tabs, spaces, commas or semicolons) list of proteins, expression, local FDR and condition. The first line will be 
treated as a header. You can download a template for the format and fill in the data. After uploading the file, click the display button to find the data. 

If you have multiple conditions you can generate an overlap plot for expression. 



Download a template for expression data ll_J 

Download a template for protein only data 0 

Click to download an example file 



Add... 



• Currently using file examplePart1.txt 



Experiment Summaries And Details 

Generate ® 



Overlap Plot 

FDR/Error less than or equal to: .01 



Heatmap 



Expression greater than: 0.00 



Figure 2. Upload tab. Users may upload their own data through the upload tab. These data can then be visualized by clicking any of the 'Generate' 
links under their associated functionalities. Experiment summaries and details create a view at the bottom of the screen akin to the view in Figure 1. 
The overlap plot and heatmap views are seen in Figure 3 and Figure 4, respectively. 



1 

;ance 
1 

T 



Intersection Diagram 




Color Key 



Row Z-Score 



Figure 3. Overlap plot. An overlap plot generated for data from Ref. 
(42) with two conditions, cancer and control. 



providing users with expression ratios and statistical sig- 
nificance for many different condition comparisons. 

Increased visualization 

MOPED remains under continuous development to im- 
prove all components of the user experience. Currently, 
work is underway to develop a plug-in for Cytoscape 
that provides pathway level visualization of the experi- 
mental data currently residing in MOPED (40). The goal 




Figure 4. Overlap plot. An overlap plot generated for data from Ref. 
(42) with two conditions, cancer and control. 
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is to maximize the user's knowledge of fluctuating patterns 
of pathway regulation (Supplementary Figure S5). 
Additionally, scripts are being developed to dynamically 
visualize experimental expression relative to the MOPED 
experiments (Supplementary Figure S6). 

Integration of other omics data 

While proteomics data provides comprehensive insight 
into cellular mechanisms at the protein level, combining 
proteomics knowledge with other omics disciplines stands 
to develop a more complete understanding of complex 
biological systems. Metabolomics, transcriptomics, 
lipidomics and genomics are notable disciplines for 
which integrated analysis with proteomics is a natural 
extension. For example, proteomics data from MOPED 
could be linked with transcriptomics data from GEO 
for common organ, tissue, localization and condition 
combinations (41). 

DISCUSSION 

Currently, proteomics datasets are either scattered 
throughout individual data repositories or trapped 
within labs' own databases. Knowledge discovery is 
often obscured by bulky datasets, non-standard formats, 
missing meta-data and limited access to data. MOPED 
presents a solution which addresses these challenges. 
MOPED provides essential statistical summaries and a 
number of query and visualization tools to relate the 
findings to those observed in other experiments. Patterns 
of expression within and across sample sets can be 
visualized, proteins of interest can be directly queried 
and condition-specific expression data can be browsed. 
As community resource, MOPED will increase reliable 
data proliferation and make analysis more comprehensive. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online: 
Supplementary Figures 5 and 6. 
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